Structural alignment of plain text books
نویسندگان
چکیده
Text alignment is one of the main processes for obtaining parallel corpora. When aligning two versions of a book, results are often affected by unpaired sections – sections which only exist in one of the versions of the book. We developed Text::Perfide::BookSync, a Perl module which performs books synchronization (structural alignment based on section delimitation), provided they have been previously annotated by Text::Perfide::BookCleaner. We discuss the need for such a tool and several implementation decisions. The main functions are described, and examples of input and output are presented. Text::Perfide::PartialAlign is an extension of the partialAlign.py tool bundled with hunalign which proposes an alternative methods for splitting bitexts.
منابع مشابه
Digital Talking Books in Multiple Languages and Varieties
This paper describes our work in digital talking book alignment, starting by our earlier efforts for the alignment of books in European Portuguese, and ending with the two challenges we are currently facing of aligning books in different varieties of Portuguese and aligning parallel books in different languages. Our alignment module proved robust enough for porting to other varieties of Portugu...
متن کاملStructural dynamics in northern Atlas of Tunisian, Jendouba area: insights from geology and gravity data
This paper presents a new interpretation of the geometry of Triassic alignment of J. Sidi Mahdi –J. Zitoun in Medjerda Valley Plain (Northern Tunisia) based on detailed analysis of gravity and seismic reflection data. The main results of gravity analysis do not show a distinguish gravity anomaly over Triassic evaporites bodies. The positive gravity anomaly seems to be related to the entire stru...
متن کاملTowards a repository of digital talking books
Considerable effort has been devoted at to increase and broaden our speech and text data resources. Digital Talking Books (DTB), comprising both speech and text data are, as such, an invaluable asset as multimedia resources. Furthermore, those DTB have been under a speech-to-text alignment procedure, either word or phone-based, to increase their potential in research activities. This paper thus...
متن کاملTechnique for automatic sentence level alignment of long speech and transcripts
A frugal approach to construct speech corpora, specially for resource deficient languages, is to exploit collections of speech and corresponding text data available in audio books, news, lectures. However, using these resources for building speech corpora require an alignment of the long duration speech data with the accompanying text data. Existing techniques for automatic speech-text alignmen...
متن کاملThe Speect text - to - speech system entry for the Blizzard Challenge 2013
This paper describes the Speect text-to-speech system entry for the Blizzard Challenge 2013. The techniques applied for the tasks of the challenge are described as well as the implementation details for the alignment of the audio books and the text-to-speech system modules. The results of the evaluations are given and discussed.
متن کامل